Introduction: Home Credit Default Risk Competition

This notebook is intended for those who are new to machine learning competitions or want a gentle introduction to the problem. I purposely avoid jumping into complicated models or joining together lots of data in order to show the basics of how to get started in machine learning! Any comments or suggestions are much appreciated.

In this notebook, we will take an initial look at the Home Credit default risk machine learning competition currently hosted on Kaggle. The objective of this competition is to use historical loan application data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification task:

  • Supervised: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features
  • Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

Data

The data is provided by Home Credit, a service dedicated to provided lines of credit (loans) to the unbanked population. Predicting whether or not a client will repay a loan or have difficulty is a critical business need, and Home Credit is hosting this competition on Kaggle to see what sort of models the machine learning community can develop to help them in this task.

There are 7 different sources of data:

  • application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid.
  • bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  • bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  • previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  • POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  • installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

This diagram shows how all of the data is related:

image

Moreover, we are provided with the definitions of all the columns (in HomeCredit_columns_description.csv) and an example of the expected submission file.

In this notebook, we will stick to using only the main application training and testing data. Although if we want to have any hope of seriously competing, we need to use all the data, for now we will stick to one file which should be more manageable. This will let us establish a baseline that we can then improve upon. With these projects, it's best to build up an understanding of the problem a little at a time rather than diving all the way in and getting completely lost!

Metric: ROC AUC

Once we have a grasp of the data (reading through the column descriptions helps immensely), we need to understand the metric by which our submission is judged. In this case, it is a common classification metric known as the Receiver Operating Characteristic Area Under the Curve (ROC AUC, also sometimes called AUROC).

The ROC AUC may sound intimidating, but it is relatively straightforward once you can get your head around the two individual concepts. The Reciever Operating Characteristic (ROC) curve graphs the true positive rate versus the false positive rate:

image

A single line on the graph indicates the curve for a single model, and movement along a line indicates changing the threshold used for classifying a positive instance. The threshold starts at 0 in the upper right to and goes to 1 in the lower left. A curve that is to the left and above another curve indicates a better model. For example, the blue model is better than the red model, which is better than the black diagonal line which indicates a naive random guessing model.

The Area Under the Curve (AUC) explains itself by its name! It is simply the area under the ROC curve. (This is the integral of the curve.) This metric is between 0 and 1 with a better model scoring higher. A model that simply guesses at random will have an ROC AUC of 0.5.

When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or the F1 score to more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the ROC AUC is a better representation of model performance.

Not that we know the background of the data we are using and the metric to maximize, let's get into exploring the data. In this notebook, as mentioned previously, we will stick to the main data sources and simple models which we can build upon in future work.

Follow-up Notebooks

For those looking to keep working on this problem, I have a series of follow-up notebooks:

I'll add more notebooks as I finish them! Thanks for all the comments!

Imports

We are using a typical data science stack: numpy, pandas, sklearn, matplotlib.

In [443]:
# numpy and pandas for data manipulation
import numpy as np
import pandas as pd 

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

# File system manangement
import os

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

Read in Data

First, we can list all the available data files. There are a total of 9 files: 1 main file for training (with target) 1 main file for testing (without the target), 1 example submission file, and 6 other files containing additional information about each loan.

In [444]:
# List files available
print(os.listdir("../input/"))
['application_test.csv', 'sample_submission.csv', 'installments_payments.csv', 'bureau_balance.csv', 'HomeCredit_columns_description.csv', 'bureau.csv', 'credit_card_balance.csv', 'POS_CASH_balance.csv', 'application_train.csv', 'previous_application.csv']
In [445]:
# Training data
app_train = pd.read_csv('../input/application_train.csv')
print('Training data shape: ', app_train.shape)
app_train.head()
Training data shape:  (307511, 122)
Out[445]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

The training data has 307511 observations (each one a separate loan) and 122 features (variables) including the TARGET (the label we want to predict).

In [446]:
# Testing data features
app_test = pd.read_csv('../input/application_test.csv')
print('Testing data shape: ', app_test.shape)
app_test.head()
Testing data shape:  (48744, 121)
Out[446]:
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN

5 rows × 121 columns

The test set is considerably smaller and lacks a TARGET column.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find intriguing areas of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.

Examine the Distribution of the Target Column

The target is what we are asked to predict: either a 0 for the loan was repaid on time, or a 1 indicating the client had payment difficulties. We can first examine the number of loans falling into each category.

In [447]:
app_train['TARGET'].value_counts()
Out[447]:
0    282686
1     24825
Name: TARGET, dtype: int64
In [448]:
app_train['TARGET'].astype(int).plot.hist();

From this information, we see this is an imbalanced class problem. There are far more loans that were repaid on time than loans that were not repaid. Once we get into more sophisticated machine learning models, we can weight the classes by their representation in the data to reflect this imbalance.

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

In [449]:
# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns
In [450]:
# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(20)
Your selected dataframe has 122 columns.
There are 67 columns that have missing values.
Out[450]:
Missing Values % of Total Values
COMMONAREA_MEDI 214865 69.9
COMMONAREA_AVG 214865 69.9
COMMONAREA_MODE 214865 69.9
NONLIVINGAPARTMENTS_MEDI 213514 69.4
NONLIVINGAPARTMENTS_MODE 213514 69.4
NONLIVINGAPARTMENTS_AVG 213514 69.4
FONDKAPREMONT_MODE 210295 68.4
LIVINGAPARTMENTS_MODE 210199 68.4
LIVINGAPARTMENTS_MEDI 210199 68.4
LIVINGAPARTMENTS_AVG 210199 68.4
FLOORSMIN_MODE 208642 67.8
FLOORSMIN_MEDI 208642 67.8
FLOORSMIN_AVG 208642 67.8
YEARS_BUILD_MODE 204488 66.5
YEARS_BUILD_MEDI 204488 66.5
YEARS_BUILD_AVG 204488 66.5
OWN_CAR_AGE 202929 66.0
LANDAREA_AVG 182590 59.4
LANDAREA_MEDI 182590 59.4
LANDAREA_MODE 182590 59.4

When it comes time to build our machine learning models, we will have to fill in these missing values (known as imputation). In later work, we will use models such as XGBoost that can handle missing values with no need for imputation. Another option would be to drop columns with a high percentage of missing values, although it is impossible to know ahead of time if these columns will be helpful to our model. Therefore, we will keep all of the columns for now.

Column Types

Let's look at the number of columns of each data type. int64 and float64 are numeric variables (which can be either discrete or continuous). object columns contain strings and are categorical features. .

In [451]:
# Number of each type of column
app_train.dtypes.value_counts()
Out[451]:
float64    65
int64      41
object     16
dtype: int64

Let's now look at the number of unique entries in each of the object (categorical) columns.

In [452]:
# Number of unique classes in each object column
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
Out[452]:
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

Most of the categorical variables have a relatively small number of unique entries. We will need to find a way to deal with these categorical variables!

Encoding Categorical Variables

Before we go any further, we need to deal with pesky categorical variables. A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

  • Label encoding: assign each unique category in a categorical variable with an integer. No new columns are created. An example is shown below

image

  • One-hot encoding: create a new column for each unique category in a categorical variable. Each observation recieves a 1 in the column for its corresponding category and a 0 in all other new columns.

image

The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. In the example above, programmer recieves a 4 and data scientist a 1, but if we did the same process again, the labels could be reversed or completely different. The actual assignment of the integers is arbitrary. Therefore, when we perform label encoding, the model might use the relative value of the feature (for example programmer = 4 and data scientist = 1) to assign weights which is not what we want. If we only have two unique values for a categorical variable (such as Male/Female), then label encoding is fine, but for more than 2 unique categories, one-hot encoding is the safe option.

There is some debate about the relative merits of these approaches, and some models can deal with label encoded categorical variables with no issues. Here is a good Stack Overflow discussion. I think (and this is just a personal opinion) for categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

In this notebook, we will use Label Encoding for any categorical variables with only 2 categories and One-Hot Encoding for any categorical variables with more than 2 categories. This process may need to change as we get further into the project, but for now, we will see where this gets us. (We will also not use any dimensionality reduction in this notebook but will explore in future iterations).

Label Encoding and One-Hot Encoding

Let's implement the policy described above: for any categorical variable (dtype == object) with 2 unique categories, we will use label encoding, and for any categorical variable with more than 2 unique categories, we will use one-hot encoding.

For label encoding, we use the Scikit-Learn LabelEncoder and for one-hot encoding, the pandas get_dummies(df) function.

In [453]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_train[col])
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)
3 columns were label encoded.
In [454]:
# one-hot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
Training Features shape:  (307511, 243)
Testing Features shape:  (48744, 239)

Aligning Training and Testing Data

There need to be the same features (columns) in both the training and testing data. One-hot encoding has created more columns in the training data because there were some categorical variables with categories not represented in the testing data. To remove the columns in the training data that are not in the testing data, we need to align the dataframes. First we extract the target column from the training data (because this is not in the testing data but we need to keep this information). When we do the align, we must make sure to set axis = 1 to align the dataframes based on the columns and not on the rows!

In [455]:
train_labels = app_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

# Add the target back in
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
Training Features shape:  (307511, 240)
Testing Features shape:  (48744, 239)

The training and testing datasets now have the same features which is required for machine learning. The number of features has grown significantly due to one-hot encoding. At some point we probably will want to try dimensionality reduction (removing features that are not relevant) to reduce the size of the datasets.

Back to Exploratory Data Analysis

Anomalies

One problem we always want to be on the lookout for when doing EDA is anomalies within the data. These may be due to mis-typed numbers, errors in measuring equipment, or they could be valid but extreme measurements. One way to support anomalies quantitatively is by looking at the statistics of a column using the describe method. The numbers in the DAYS_BIRTH column are negative because they are recorded relative to the current loan application. To see these stats in years, we can mutliple by -1 and divide by the number of days in a year:

In [456]:
(app_train['DAYS_BIRTH'] / -365).describe()
Out[456]:
count    307511.000000
mean         43.936973
std          11.956133
min          20.517808
25%          34.008219
50%          43.150685
75%          53.923288
max          69.120548
Name: DAYS_BIRTH, dtype: float64

Those ages look reasonable. There are no outliers for the age on either the high or low end. How about the days of employment?

In [457]:
app_train['DAYS_EMPLOYED'].describe()
Out[457]:
count    307511.000000
mean      63815.045904
std      141275.766519
min      -17912.000000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64

That doesn't look right! The maximum value (besides being positive) is about 1000 years!

In [458]:
app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment');

Just out of curiousity, let's subset the anomalous clients and see if they tend to have higher or low rates of default than the rest of the clients.

In [459]:
anom = app_train[app_train['DAYS_EMPLOYED'] == 365243]
non_anom = app_train[app_train['DAYS_EMPLOYED'] != 365243]
print('The non-anomalies default on %0.2f%% of loans' % (100 * non_anom['TARGET'].mean()))
print('The anomalies default on %0.2f%% of loans' % (100 * anom['TARGET'].mean()))
print('There are %d anomalous days of employment' % len(anom))
The non-anomalies default on 8.66% of loans
The anomalies default on 5.40% of loans
There are 55374 anomalous days of employment

Well that is extremely interesting! It turns out that the anomalies have a lower rate of default.

Handling the anomalies depends on the exact situation, with no set rules. One of the safest approaches is just to set the anomalies to a missing value and then have them filled in (using Imputation) before machine learning. In this case, since all the anomalies have the exact same value, we want to fill them in with the same value in case all of these loans share something in common. The anomalous values seem to have some importance, so we want to tell the machine learning model if we did in fact fill in these values. As a solution, we will fill in the anomalous values with not a number (np.nan) and then create a new boolean column indicating whether or not the value was anomalous.

In [460]:
# Create an anomalous flag column
app_train['DAYS_EMPLOYED_ANOM'] = app_train["DAYS_EMPLOYED"] == 365243

# Replace the anomalous values with nan
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment');

The distribution looks to be much more in line with what we would expect, and we also have created a new column to tell the model that these values were originally anomalous (becuase we will have to fill in the nans with some value, probably the median of the column). The other columns with DAYS in the dataframe look to be about what we expect with no obvious outliers.

As an extremely important note, anything we do to the training data we also have to do to the testing data. Let's make sure to create the new column and fill in the existing column with np.nan in the testing data.

In [461]:
app_test['DAYS_EMPLOYED_ANOM'] = app_test["DAYS_EMPLOYED"] == 365243
app_test["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)

print('There are %d anomalies in the test data out of %d entries' % (app_test["DAYS_EMPLOYED_ANOM"].sum(), len(app_test)))
There are 9274 anomalies in the test data out of 48744 entries

Correlations

Now that we have dealt with the categorical variables and the outliers, let's continue with the EDA. One way to try and understand the data is by looking for correlations between the features and the target. We can calculate the Pearson correlation coefficient between every variable and the target using the .corr dataframe method.

The correlation coefficient is not the greatest method to represent "relevance" of a feature, but it does give us an idea of possible relationships within the data. Some general interpretations of the absolute value of the correlation coefficent are:

  • .00-.19 “very weak”
  • .20-.39 “weak”
  • .40-.59 “moderate”
  • .60-.79 “strong”
  • .80-1.0 “very strong”
In [462]:
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))
Most Positive Correlations:
 OCCUPATION_TYPE_Laborers                             0.043019
FLAG_DOCUMENT_3                                      0.044346
REG_CITY_NOT_LIVE_CITY                               0.044395
FLAG_EMP_PHONE                                       0.045982
NAME_EDUCATION_TYPE_Secondary / secondary special    0.049824
REG_CITY_NOT_WORK_CITY                               0.050994
DAYS_ID_PUBLISH                                      0.051457
CODE_GENDER_M                                        0.054713
DAYS_LAST_PHONE_CHANGE                               0.055218
NAME_INCOME_TYPE_Working                             0.057481
REGION_RATING_CLIENT                                 0.058899
REGION_RATING_CLIENT_W_CITY                          0.060893
DAYS_EMPLOYED                                        0.074958
DAYS_BIRTH                                           0.078239
TARGET                                               1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 EXT_SOURCE_3                           -0.178919
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_1                           -0.155317
NAME_EDUCATION_TYPE_Higher education   -0.056593
CODE_GENDER_F                          -0.054704
NAME_INCOME_TYPE_Pensioner             -0.046209
DAYS_EMPLOYED_ANOM                     -0.045987
ORGANIZATION_TYPE_XNA                  -0.045987
FLOORSMAX_AVG                          -0.044003
FLOORSMAX_MEDI                         -0.043768
FLOORSMAX_MODE                         -0.043226
EMERGENCYSTATE_MODE_No                 -0.042201
HOUSETYPE_MODE_block of flats          -0.040594
AMT_GOODS_PRICE                        -0.039645
REGION_POPULATION_RELATIVE             -0.037227
Name: TARGET, dtype: float64

Let's take a look at some of more significant correlations: the DAYS_BIRTH is the most positive correlation. (except for TARGET because the correlation of a variable with itself is always 1!) Looking at the documentation, DAYS_BIRTH is the age in days of the client at the time of the loan in negative days (for whatever reason!). The correlation is positive, but the value of this feature is actually negative, meaning that as the client gets older, they are less likely to default on their loan (ie the target == 0). That's a little confusing, so we will take the absolute value of the feature and then the correlation will be negative.

Effect of Age on Repayment

In [463]:
# Find the correlation of the positive days since birth and target
app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])
app_train['DAYS_BIRTH'].corr(app_train['TARGET'])
Out[463]:
-0.07823930830984513

As the client gets older, there is a negative linear relationship with the target meaning that as clients get older, they tend to repay their loans on time more often.

Let's start looking at this variable. First, we can make a histogram of the age. We will put the x axis in years to make the plot a little more understandable.

In [464]:
# Set the style of plots
plt.style.use('fivethirtyeight')

# Plot the distribution of ages in years
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');

By itself, the distribution of age does not tell us much other than that there are no outliers as all the ages are reasonable. To visualize the effect of the age on the target, we will next make a kernel density estimation plot (KDE) colored by the value of the target. A kernel density estimate plot shows the distribution of a single variable and can be thought of as a smoothed histogram (it is created by computing a kernel, usually a Gaussian, at each data point and then averaging all the individual kernels to develop a single smooth curve). We will use the seaborn kdeplot for this graph.

In [465]:
plt.figure(figsize = (10, 8))

# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')

# KDE plot of loans which were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');

The target == 1 curve skews towards the younger end of the range. Although this is not a significant correlation (-0.07 correlation coefficient), this variable is likely going to be useful in a machine learning model because it does affect the target. Let's look at this relationship in another way: average failure to repay loans by age bracket.

To make this graph, first we cut the age category into bins of 5 years each. Then, for each bin, we calculate the average value of the target, which tells us the ratio of loans that were not repaid in each age category.

In [466]:
# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)
Out[466]:
TARGET DAYS_BIRTH YEARS_BIRTH YEARS_BINNED
0 1 9461 25.920548 (25.0, 30.0]
1 0 16765 45.931507 (45.0, 50.0]
2 0 19046 52.180822 (50.0, 55.0]
3 0 19005 52.068493 (50.0, 55.0]
4 0 19932 54.608219 (50.0, 55.0]
5 0 16941 46.413699 (45.0, 50.0]
6 0 13778 37.747945 (35.0, 40.0]
7 0 18850 51.643836 (50.0, 55.0]
8 0 20099 55.065753 (55.0, 60.0]
9 0 14469 39.641096 (35.0, 40.0]
In [467]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups
Out[467]:
TARGET DAYS_BIRTH YEARS_BIRTH
YEARS_BINNED
(20.0, 25.0] 0.123036 8532.795625 23.377522
(25.0, 30.0] 0.111436 10155.219250 27.822518
(30.0, 35.0] 0.102814 11854.848377 32.479037
(35.0, 40.0] 0.089414 13707.908253 37.555913
(40.0, 45.0] 0.078491 15497.661233 42.459346
(45.0, 50.0] 0.074171 17323.900441 47.462741
(50.0, 55.0] 0.066968 19196.494791 52.593136
(55.0, 60.0] 0.055314 20984.262742 57.491131
(60.0, 65.0] 0.052737 22780.547460 62.412459
(65.0, 70.0] 0.037270 24292.614340 66.555108
In [468]:
plt.figure(figsize = (8, 8))

# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');

There is a clear trend: younger applicants are more likely to not repay the loan! The rate of failure to repay is above 10% for the youngest three age groups and beolow 5% for the oldest age group.

This is information that could be directly used by the bank: because younger clients are less likely to repay the loan, maybe they should be provided with more guidance or financial planning tips. This does not mean the bank should discriminate against younger clients, but it would be smart to take precautionary measures to help younger clients pay on time.

Exterior Sources

The 3 variables with the strongest negative correlations with the target are EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3. According to the documentation, these features represent a "normalized score from external data source". I'm not sure what this exactly means, but it may be a cumulative sort of credit rating made using numerous sources of data.

Let's take a look at these variables.

First, we can show the correlations of the EXT_SOURCE features with the target and with each other.

In [469]:
# Extract the EXT_SOURCE variables and show correlations
#ext_data = app_train.sample(100000)
ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
ext_data_corrs
Out[469]:
TARGET EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH
TARGET 1.000000 -0.155317 -0.160472 -0.178919 -0.078239
EXT_SOURCE_1 -0.155317 1.000000 0.213982 0.186846 0.600610
EXT_SOURCE_2 -0.160472 0.213982 1.000000 0.109167 0.091996
EXT_SOURCE_3 -0.178919 0.186846 0.109167 1.000000 0.205478
DAYS_BIRTH -0.078239 0.600610 0.091996 0.205478 1.000000
In [470]:
plt.figure(figsize = (8, 6))

# Heatmap of correlations
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap');

All three EXT_SOURCE featureshave negative correlations with the target, indicating that as the value of the EXT_SOURCE increases, the client is more likely to repay the loan. We can also see that DAYS_BIRTH is positively correlated with EXT_SOURCE_1 indicating that maybe one of the factors in this score is the client age.

Next we can look at the distribution of each of these features colored by the value of the target. This will let us visualize the effect of this variable on the target.

In [471]:
plt.figure(figsize = (10, 12))

# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    
    # create a new subplot for each source
    plt.subplot(3, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)
    

EXT_SOURCE_3 displays the greatest difference between the values of the target. We can clearly see that this feature has some relationship to the likelihood of an applicant to repay a loan. The relationship is not very strong (in fact they are all considered very weak, but these variables will still be useful for a machine learning model to predict whether or not an applicant will repay a loan on time.

Pairs Plot

As a final exploratory plot, we can make a pairs plot of the EXT_SOURCE variables and the DAYS_BIRTH variable. The Pairs Plot is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables. Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots and correlation coefficients on the lower triangle.

If you don't understand this code, that's all right! Plotting in Python can be overly complex, and for anything beyond the simplest graphs, I usually find an existing implementation and adapt the code (don't repeat yourself)!

In [472]:
# Copy the data for plotting
plot_data = ext_data.drop(columns = ['DAYS_BIRTH']).copy()

# Add in the age of the client in years
plot_data['YEARS_BIRTH'] = age_data['YEARS_BIRTH']

# Drop na values and limit to first 100000 rows
plot_data = plot_data.dropna().loc[:100000, :]

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3, diag_sharey=False,
                    hue = 'TARGET', 
                    vars = [x for x in list(plot_data.columns) if x != 'TARGET'])

# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)

# Diagonal is a histogram
grid.map_diag(sns.kdeplot)

# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

plt.suptitle('Ext Source and Age Features Pairs Plot', size = 32, y = 1.05);

In this plot, the red indicates loans that were not repaid and the blue are loans that are paid. We can see the different relationships within the data. There does appear to be a moderate positive linear relationship between the EXT_SOURCE_1 and the DAYS_BIRTH (or equivalently YEARS_BIRTH), indicating that this feature may take into account the age of the client.

Evaluation de diverses modèles d'apprentissage sur les données directement issues de l'analyse exploratoire es données

Récupération d'un échantillon des données

In [473]:
import timeit
import lime
import lime.lime_tabular
In [474]:
X = app_train.sample(100000)

X.describe()
Out[474]:
SK_ID_CURR NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE ... WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_No EMERGENCYSTATE_MODE_Yes TARGET
count 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 1.000000e+05 1.000000e+05 99995.000000 9.990400e+04 100000.000000 ... 100000.000000 100000.000000 100000.00000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000
mean 278407.484360 0.093350 0.340180 0.691180 0.417670 1.683507e+05 5.998412e+05 27156.462428 5.391729e+05 0.020827 ... 0.030380 0.007160 0.00589 0.005450 0.215510 0.209620 0.017540 0.517070 0.007930 0.080900
std 102688.442994 0.290924 0.473772 0.462009 0.720768 1.086230e+05 4.027973e+05 14532.556773 3.695235e+05 0.013804 ... 0.171631 0.084314 0.07652 0.073623 0.411178 0.407039 0.131273 0.499711 0.088697 0.272683
min 100003.000000 0.000000 0.000000 0.000000 0.000000 2.565000e+04 4.500000e+04 1980.000000 4.500000e+04 0.000290 ... 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 189638.500000 0.000000 0.000000 0.000000 0.000000 1.125000e+05 2.700000e+05 16542.000000 2.385000e+05 0.010006 ... 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 278599.000000 0.000000 0.000000 1.000000 0.000000 1.440000e+05 5.175000e+05 24939.000000 4.500000e+05 0.018850 ... 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
75% 366924.500000 0.000000 1.000000 1.000000 1.000000 2.025000e+05 8.086500e+05 34650.000000 6.795000e+05 0.028663 ... 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
max 456255.000000 1.000000 1.000000 1.000000 14.000000 1.350000e+07 4.050000e+06 258025.500000 4.050000e+06 0.072508 ... 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 240 columns

Imputation des valeurs manquantes dans les données pertinentes

In [475]:
(1-X.isna().mean()).unique()
Out[475]:
array([1.     , 0.99995, 0.99904, 0.81966, 0.34017, 0.99998, 0.43568,
       0.99788, 0.80002, 0.49183, 0.41574, 0.51148, 0.33506, 0.30162,
       0.46776, 0.49646, 0.50194, 0.32181, 0.40645, 0.31664, 0.4977 ,
       0.30607, 0.44841, 0.51645, 0.99682, 0.86416])
In [476]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')
imputer.fit(X)
X.loc[:,:] = imputer.transform(X)
In [477]:
y = X['TARGET']
X = X.drop(columns=['TARGET'])
In [478]:
1-X.isna().mean()
Out[478]:
SK_ID_CURR                         1.0
NAME_CONTRACT_TYPE                 1.0
FLAG_OWN_CAR                       1.0
FLAG_OWN_REALTY                    1.0
CNT_CHILDREN                       1.0
                                  ... 
WALLSMATERIAL_MODE_Stone, brick    1.0
WALLSMATERIAL_MODE_Wooden          1.0
EMERGENCYSTATE_MODE_No             1.0
EMERGENCYSTATE_MODE_Yes            1.0
DAYS_EMPLOYED_ANOM                 1.0
Length: 240, dtype: float64

Pre-traitement des données

In [479]:
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=666)

# liste labels de colonnes
feat = list(X_train.columns)
In [480]:
from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)

Fonctions de traitement de la tâche

In [581]:
from sklearn import metrics


def apprentissage(classifier, X_train, y_train, X_test, y_test, param, cross_val=5, scoring=None, has_feature_importance=None, feat_nm=None):
    
    #Initializing the grid search
    classifier_gs = model_selection.GridSearchCV(classifier, scoring=scoring, param_grid=param, cv=cross_val, n_jobs=10)
    
    #Training the model with a capture of elapsed time
    start_time = timeit.default_timer()
    classifier_gs.fit(X_train, y_train)
    elapsed = timeit.default_timer() - start_time
    
    print("The computation time for training the model is {:.2f}s".format(elapsed))
    
    #Initializing the prediction over the test labels
    y_prob = classifier_gs.predict_proba(X_test)[:,1]
    y_pred = np.where(y_prob > 0.5, 1, 0)
    fpr, tpr, thr = metrics.roc_curve(y_test, y_prob)
    
    roc_auc = metrics.auc(fpr, tpr)
    f1_sc = metrics.f1_score(y_test, y_pred)
    rcall = metrics.recall_score(y_test, y_pred)
    cmatrx = metrics.confusion_matrix(y_test, y_pred)
    time = elapsed
    bus_metrics = metrics.fbeta_score(y_test, y_pred, beta=2)
    
    #Feature importance
    if has_feature_importance==True:
        feat_importances = classifier_gs.best_estimator_.feature_importances_[0:12]
        sorted_idx = feat_importances.argsort()
        feature_names = X.columns[sorted_idx]

    
        y_ticks = np.arange(0, len(feature_names))
        fig, ax = plt.subplots()
        #ax.barh(sorted_idx, feat_importances[sorted_idx])
        ax.barh(feature_names, feat_importances[sorted_idx])
        ax.set_yticklabels(feature_names[sorted_idx])
        ax.set_yticks(y_ticks)
        ax.set_title("Feature Importances")
        #fig.tight_layout()
        plt.savefig('./Feat_importance.png', bbox_inches='tight')
        plt.show()
    
    #Model interpretation
    explainer = lime.lime_tabular.LimeTabularExplainer(X_train, 
                                                   mode= 'classification',
                                                   training_labels= y_train,
                                                   feature_names= feat_nm
                                                  )
    ind=1
    exp= explainer.explain_instance(X_train[ind,:], classifier_gs.predict_proba, num_features=10)
    exp.show_in_notebook(show_table=True) 

    #Building the results
    res={}
    for string in param.keys():
        res[string]=param[string]
    for string in classifier_gs.best_params_.keys():
        res[string+'_opt']=classifier_gs.best_params_[string]
    
    res.update({'cv':cross_val, 'time':time, 'roc_auc':roc_auc, 'recall':rcall, 'f1_score':f1_sc, 'confusion_matrix':cmatrx, 'business_metrics':bus_metrics})
    return res

# Construit de façon incrémentale un tableau récapitulatif des éléments principaux de l'apprentissage
def affichage_resultats(apprentissage, df, classifier_name):
    line = pd.Series(apprentissage, name=classifier_name)
    df = df.append(line)
    return df

# Extrait un sous-ensemble pertinent depuis le tableau récapitulatif, qui est alors ordonné suivant une métrique particulière
def xtract_perfo(recap_df, col, score):
    df = recap_df[col]
    df = df.sort_values(by=score, ascending=False)
    return df

Approches naïves

In [482]:
from sklearn import dummy

computation = apprentissage(
    dummy.DummyClassifier(),
    X_train_std, y_train,
    X_test_std, y_test,
    {}, feat_nm=feat
)
The computation time for training the model is 0.67s
In [483]:
recap_perfo = pd.DataFrame()
recap_perfo = affichage_resultats(computation, recap_perfo, 'Dummy Classifier')
recap_perfo
Out[483]:
business_metrics confusion_matrix cv f1_score recall roc_auc time
Dummy Classifier 0.083099 [[25307, 2293], [2199, 201]] 5.0 0.082141 0.08375 0.500335 0.673406

Regression logistique

In [484]:
from sklearn import linear_model
In [485]:
computation = apprentissage(
    linear_model.LogisticRegression(solver = 'liblinear'),
    X_train_std, y_train,
    X_test_std, y_test,
    {'C': np.logspace(-3, 3, 5)}, feat_nm=feat
)
The computation time for training the model is 176.80s
In [486]:
recap_perfo = affichage_resultats(computation, recap_perfo, 'Regression Logistique')
recap_perfo
Out[486]:
business_metrics confusion_matrix cv f1_score recall roc_auc time C C_opt
Dummy Classifier 0.083099 [[25307, 2293], [2199, 201]] 5.0 0.082141 0.083750 0.500335 0.673406 NaN NaN
Regression Logistique 0.002602 [[27597, 3], [2395, 5]] 5.0 0.004153 0.002083 0.739528 176.802237 [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001

Arbre de décision

In [487]:
from sklearn import tree
In [488]:
computation = apprentissage(
    tree.DecisionTreeClassifier(),
    X_train_std, y_train,
    X_test_std, y_test,
    {'max_depth':np.arange(2,10)}, feat_nm=feat
)
The computation time for training the model is 8.47s
In [489]:
recap_perfo=affichage_resultats(computation, recap_perfo,'Arbre de décision')
recap_perfo
Out[489]:
business_metrics confusion_matrix cv f1_score recall roc_auc time C C_opt max_depth max_depth_opt
Dummy Classifier 0.083099 [[25307, 2293], [2199, 201]] 5.0 0.082141 0.083750 0.500335 0.673406 NaN NaN NaN NaN
Regression Logistique 0.002602 [[27597, 3], [2395, 5]] 5.0 0.004153 0.002083 0.739528 176.802237 [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001 NaN NaN
Arbre de décision 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.644316 8.473992 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0

Random forests

In [490]:
from sklearn import ensemble
In [491]:
computation = apprentissage(
    ensemble.RandomForestClassifier(oob_score=True, n_estimators=100),
    X_train_std, y_train, 
    X_test_std, y_test,
    {'max_depth':np.arange(2,10)}, 
    feat_nm=feat,
    has_feature_importance=True
)
The computation time for training the model is 49.89s
In [492]:
recap_perfo=affichage_resultats(computation, recap_perfo,'Random Forests')
recap_perfo
Out[492]:
business_metrics confusion_matrix cv f1_score recall roc_auc time C C_opt max_depth max_depth_opt
Dummy Classifier 0.083099 [[25307, 2293], [2199, 201]] 5.0 0.082141 0.083750 0.500335 0.673406 NaN NaN NaN NaN
Regression Logistique 0.002602 [[27597, 3], [2395, 5]] 5.0 0.004153 0.002083 0.739528 176.802237 [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001 NaN NaN
Arbre de décision 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.644316 8.473992 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0
Random Forests 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.706579 49.885437 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0

Extra Trees

In [493]:
computation = apprentissage(
    ensemble.ExtraTreesClassifier(n_estimators=100),
    X_train_std, y_train,
    X_test_std, y_test,
    {'max_depth':np.arange(2,10)}, 
    feat_nm=feat
)
The computation time for training the model is 43.32s
In [494]:
recap_perfo=affichage_resultats(computation, recap_perfo,'Extra Trees')
recap_perfo
Out[494]:
business_metrics confusion_matrix cv f1_score recall roc_auc time C C_opt max_depth max_depth_opt
Dummy Classifier 0.083099 [[25307, 2293], [2199, 201]] 5.0 0.082141 0.083750 0.500335 0.673406 NaN NaN NaN NaN
Regression Logistique 0.002602 [[27597, 3], [2395, 5]] 5.0 0.004153 0.002083 0.739528 176.802237 [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001 NaN NaN
Arbre de décision 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.644316 8.473992 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0
Random Forests 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.706579 49.885437 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0
Extra Trees 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.694185 43.322373 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0

XGradient Boosting

In [495]:
import xgboost as xgb
In [496]:
computation = apprentissage(
    xgb.XGBClassifier(n_estimators=100),
    X_train_std, y_train,
    X_test_std, y_test,
    {'learning_rate': np.logspace(-3, 3, 7), 'max_depth':np.arange(2,4)},
    feat_nm=feat,
    has_feature_importance=True
)
The computation time for training the model is 125.24s
In [497]:
recap_perfo_basic = affichage_resultats(computation, recap_perfo,'XGBoosting')
recap_perfo_basic
Out[497]:
business_metrics confusion_matrix cv f1_score recall roc_auc time C C_opt max_depth max_depth_opt learning_rate learning_rate_opt
Dummy Classifier 0.083099 [[25307, 2293], [2199, 201]] 5.0 0.082141 0.083750 0.500335 0.673406 NaN NaN NaN NaN NaN NaN
Regression Logistique 0.002602 [[27597, 3], [2395, 5]] 5.0 0.004153 0.002083 0.739528 176.802237 [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001 NaN NaN NaN NaN
Arbre de décision 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.644316 8.473992 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0 NaN NaN
Random Forests 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.706579 49.885437 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0 NaN NaN
Extra Trees 0.000000 [[27600, 0], [2400, 0]] 5.0 0.000000 0.000000 0.694185 43.322373 NaN NaN [2, 3, 4, 5, 6, 7, 8, 9] 2.0 NaN NaN
XGBoosting 0.013987 [[27575, 25], [2373, 27]] 5.0 0.022023 0.011250 0.748577 125.235332 NaN NaN [2, 3] 3.0 [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0] 0.1
In [498]:
recap_perfo_basic.to_excel('recap_perfo_basic_v441.xlsx',sheet_name='basic')
In [552]:
perfo_basic = xtract_perfo(recap_perfo_basic, 
                       ['roc_auc','f1_score','max_depth_opt', 'C_opt', 'learning_rate_opt'],
                       'roc_auc'
                      )
perfo_basic.to_excel('perfo_basic.xlsx',sheet_name='basic')
perfo_basic
Out[552]:
roc_auc f1_score max_depth_opt C_opt learning_rate_opt
XGBoosting 0.748577 0.022023 3.0 NaN 0.1
Regression Logistique 0.739528 0.004153 NaN 0.001 NaN
Random Forests 0.706579 0.000000 2.0 NaN NaN
Extra Trees 0.694185 0.000000 2.0 NaN NaN
Arbre de décision 0.644316 0.000000 2.0 NaN NaN
Dummy Classifier 0.500335 0.082141 NaN NaN NaN

Introduction de la métrique métier

Définition de la métrique métier

Version complexe de la métrique métier

In [499]:
##TP*(Cost related to ) TODO: Factoriser l'appel de metrics.confusionmatrix

# def cost_based_score(y, y_pred, p=0.5, r=0.05, r0=0.015, has_opp_cost=True):
#     pcp = 100.0
#     oc = int(has_opp_cost==True)
# #    [TP FN], [FP TN] = metrics.confusion_matrix(y, y_pred).ravel()
# #    TP, FN, FP, TN = metrics.confusion_matrix(y, y_pred)
#     return [metrics.confusion_matrix(y, y_pred)[0][0] * (pcp*((1 + r0) - oc*(p*(1+r) - (1-p)*(1+r)) 
#                       ) 
#                                                          ) + 
#             metrics.confusion_matrix(y, y_pred)[0][1] * (pcp*((p*(1+r) - (1-p)*(1+r)) - oc*(1 + r0) 
#                  ) 
#                                                          ) + 
#             metrics.confusion_matrix(y, y_pred)[1][0] * (pcp*((1+r) - oc*(1 + r0) 
#                       ) 
#                                                          ) + 
#             metrics.confusion_matrix(y, y_pred)[1][1] * (pcp*((1 + r0) - oc*(1+r) 
#                       ) 
#                                                          ) 
#             ] /(metrics.confusion_matrix(y, y_pred).sum()*pcp)

Version simple de la métrique métier

In [500]:
# from sklearn import metrics, model_selection #Provisoire
# business_metrics = metrics.make_scorer(cost_based_score, greater_is_better=True)

from sklearn.metrics import fbeta_score, make_scorer

business_metrics = make_scorer(fbeta_score, beta=2)

Regression logistique

In [501]:
computation = apprentissage(
    linear_model.LogisticRegression(solver = 'liblinear'),
    X_train_std, y_train, 
    X_test_std, y_test,
    {'C': np.logspace(-3, 3, 5)},
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 238.53s
In [502]:
recap_perfo = pd.DataFrame(columns=computation.keys())
recap_perfo = affichage_resultats(computation, recap_perfo,'Regression Logistique')
recap_perfo
Out[502]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 31.622777 5 238.527968 0.739821 0.014583 0.028226 [[27555, 45], [2365, 35]] 0.018079

Random forests

In [503]:
computation = apprentissage(
    ensemble.RandomForestClassifier(oob_score=True, n_estimators=100),
    X_train_std, y_train,
    X_test_std, y_test,
    {'max_depth':np.arange(2,10)},
    has_feature_importance=True,
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 49.07s
In [504]:
recap_perfo = affichage_resultats(computation, recap_perfo,'Random Forests')
recap_perfo
Out[504]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 31.622777 5 238.527968 0.739821 0.014583 0.028226 [[27555, 45], [2365, 35]] 0.018079 NaN NaN
Random Forests NaN NaN 5 49.070976 0.700680 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0

Extra Trees

In [505]:
computation = apprentissage(
    ensemble.ExtraTreesClassifier(n_estimators=100),
    X_train_std, y_train,
    X_test_std, y_test,
    {'max_depth':np.arange(2,10)},
    scoring=business_metrics
)
The computation time for training the model is 45.06s
In [506]:
recap_perfo = affichage_resultats(computation, recap_perfo,'Extra Trees')
recap_perfo
Out[506]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 31.622777 5 238.527968 0.739821 0.014583 0.028226 [[27555, 45], [2365, 35]] 0.018079 NaN NaN
Random Forests NaN NaN 5 49.070976 0.700680 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0
Extra Trees NaN NaN 5 45.063417 0.694722 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0

XGradient Boosting

In [507]:
computation = apprentissage(
    xgb.XGBClassifier(n_estimators=100),
    X_train_std, y_train,
    X_test_std, y_test,
    {'learning_rate': np.logspace(-3, 0, 5), 'max_depth':np.arange(2,4)},
    has_feature_importance=True,
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 121.35s
In [508]:
recap_perfo_metier = affichage_resultats(computation, recap_perfo, 'XGBoosting')
recap_perfo_metier
Out[508]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt learning_rate learning_rate_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 31.622777 5 238.527968 0.739821 0.014583 0.028226 [[27555, 45], [2365, 35]] 0.018079 NaN NaN NaN NaN
Random Forests NaN NaN 5 49.070976 0.700680 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0 NaN NaN
Extra Trees NaN NaN 5 45.063417 0.694722 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0 NaN NaN
XGBoosting NaN NaN 5 121.354708 0.716376 0.064583 0.107452 [[27270, 330], [2245, 155]] 0.076847 [2, 3] 3.0 [0.001, 0.005623413251903491, 0.03162277660168... 1.0
In [535]:
recap_perfo_metier.to_excel('recap_perfo_metier_v441.xlsx',sheet_name='metier')
In [553]:
perfo_metier = xtract_perfo(recap_perfo_metier, 
                       ['business_metrics','roc_auc','f1_score','max_depth_opt', 'C_opt', 'learning_rate_opt'],
                       'business_metrics'
                      )
perfo_metier.to_excel('perfo_metier.xlsx',sheet_name='metier')
perfo_metier
Out[553]:
business_metrics roc_auc f1_score max_depth_opt C_opt learning_rate_opt
XGBoosting 0.076847 0.716376 0.107452 3.0 NaN 1.0
Regression Logistique 0.018079 0.739821 0.028226 NaN 31.622777 NaN
Random Forests 0.000000 0.700680 0.000000 2.0 NaN NaN
Extra Trees 0.000000 0.694722 0.000000 2.0 NaN NaN

Feature Engineering - Adding 3 new variables

In [510]:
X2=X

X2['CREDIT_ANNUITY_RATIO'] = X2['AMT_CREDIT'] / X2['AMT_ANNUITY']
X2["credit_income_ration"] = X2["AMT_CREDIT"] / X2["AMT_INCOME_TOTAL"]
X2["annuity_income_ratio"] = X2["AMT_ANNUITY"] / X2["AMT_INCOME_TOTAL"]
X2["credit_good_ratio"] = X2["AMT_CREDIT"] / X2["AMT_GOODS_PRICE"]
In [511]:
X2 = X2.replace(np.NINF,0)
X2 = X2.replace(np.inf,0)
In [512]:
X2.head()
X2.describe()
Out[512]:
SK_ID_CURR NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE ... WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_No EMERGENCYSTATE_MODE_Yes DAYS_EMPLOYED_ANOM CREDIT_ANNUITY_RATIO credit_income_ration annuity_income_ratio credit_good_ratio
count 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 1.000000e+05 1.000000e+05 100000.000000 1.000000e+05 100000.000000 ... 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000
mean 278407.484360 0.093350 0.340180 0.691180 0.417670 1.683507e+05 5.998412e+05 27156.351555 5.390873e+05 0.020827 ... 0.215510 0.209620 0.017540 0.517070 0.007930 0.180340 21.610932 3.963736 0.181235 1.122667
std 102688.442994 0.290924 0.473772 0.462009 0.720768 1.086230e+05 4.027973e+05 14532.201910 3.693564e+05 0.013804 ... 0.411178 0.407039 0.131273 0.499711 0.088697 0.384472 7.835403 2.690322 0.094699 0.126679
min 100003.000000 0.000000 0.000000 0.000000 0.000000 2.565000e+04 4.500000e+04 1980.000000 4.500000e+04 0.000290 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.036791 0.103741 0.006000 0.150000
25% 189638.500000 0.000000 0.000000 0.000000 0.000000 1.125000e+05 2.700000e+05 16542.000000 2.385000e+05 0.010006 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 15.606490 2.000920 0.114920 1.000000
50% 278599.000000 0.000000 0.000000 1.000000 0.000000 1.440000e+05 5.175000e+05 24939.000000 4.500000e+05 0.018850 ... 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 20.000000 3.281250 0.163380 1.118800
75% 366924.500000 0.000000 1.000000 1.000000 1.000000 2.025000e+05 8.086500e+05 34650.000000 6.795000e+05 0.028663 ... 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 27.218291 5.200000 0.229943 1.198000
max 456255.000000 1.000000 1.000000 1.000000 14.000000 1.350000e+07 4.050000e+06 258025.500000 4.050000e+06 0.072508 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 59.474377 84.736842 1.875965 6.000000

8 rows × 244 columns

In [513]:
1-X2.isna().mean()
Out[513]:
SK_ID_CURR              1.0
NAME_CONTRACT_TYPE      1.0
FLAG_OWN_CAR            1.0
FLAG_OWN_REALTY         1.0
CNT_CHILDREN            1.0
                       ... 
DAYS_EMPLOYED_ANOM      1.0
CREDIT_ANNUITY_RATIO    1.0
credit_income_ration    1.0
annuity_income_ratio    1.0
credit_good_ratio       1.0
Length: 244, dtype: float64

Pre-traitement des données

In [514]:
from sklearn import model_selection
X2_train, X2_test, y_train, y_test = model_selection.train_test_split(X2, y, test_size=0.3, random_state=666)

# liste labels de colonnes
feat = list(X2_train.columns)
In [515]:
from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(X2_train)
X2_train_std = std_scale.transform(X2_train)
X2_test_std = std_scale.transform(X2_test)

Regression logistique

In [516]:
computation = apprentissage(
    linear_model.LogisticRegression(solver = 'liblinear'),
    X2_train_std, y_train, 
    X2_test_std, y_test,
    {'C': np.logspace(-3, 3, 5)},
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 200.52s
In [517]:
recap_perfo = pd.DataFrame(columns=computation.keys())
recap_perfo = affichage_resultats(computation, recap_perfo,'Regression Logistique')
recap_perfo
Out[517]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 1.0 5 200.520529 0.741718 0.015833 0.030547 [[27550, 50], [2362, 38]] 0.019612

Random forests

In [518]:
computation = apprentissage(
    ensemble.RandomForestClassifier(oob_score=True, n_estimators=100),
    X2_train_std, y_train, 
    X2_test_std, y_test,
    {'max_depth':np.arange(2,10)},
    feat_nm=feat,
    has_feature_importance=True,
    scoring=business_metrics
)
The computation time for training the model is 49.69s
In [519]:
recap_perfo = affichage_resultats(computation, recap_perfo,'Random Forests')
recap_perfo
Out[519]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 1.0 5 200.520529 0.741718 0.015833 0.030547 [[27550, 50], [2362, 38]] 0.019612 NaN NaN
Random Forests NaN NaN 5 49.687325 0.708015 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0

Extra Trees

In [520]:
computation = apprentissage(
    ensemble.ExtraTreesClassifier(n_estimators=100),
    X_train_std, y_train,
    X_test_std, y_test,
    {'max_depth':np.arange(2,10)},
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 43.24s
In [521]:
recap_perfo=affichage_resultats(computation, recap_perfo,'Extra Trees')
recap_perfo
Out[521]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 1.0 5 200.520529 0.741718 0.015833 0.030547 [[27550, 50], [2362, 38]] 0.019612 NaN NaN
Random Forests NaN NaN 5 49.687325 0.708015 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0
Extra Trees NaN NaN 5 43.235018 0.687073 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0

XGradient Boosting

In [522]:
computation = apprentissage(
    xgb.XGBClassifier(n_estimators=100), 
    X2_train_std, y_train,
    X2_test_std, y_test,
    {'learning_rate': np.logspace(-3, 0, 5), 'max_depth':np.arange(2,4)},
    feat_nm=feat,
    has_feature_importance=True,
    scoring=business_metrics
)
The computation time for training the model is 119.64s
In [523]:
recap_perfo_feat=affichage_resultats(computation, recap_perfo, 'XGBoosting')
recap_perfo_feat
Out[523]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt learning_rate learning_rate_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 1.0 5 200.520529 0.741718 0.015833 0.030547 [[27550, 50], [2362, 38]] 0.019612 NaN NaN NaN NaN
Random Forests NaN NaN 5 49.687325 0.708015 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0 NaN NaN
Extra Trees NaN NaN 5 43.235018 0.687073 0.000000 0.000000 [[27600, 0], [2400, 0]] 0.000000 [2, 3, 4, 5, 6, 7, 8, 9] 2.0 NaN NaN
XGBoosting NaN NaN 5 119.638144 0.724017 0.069583 0.111556 [[27173, 427], [2233, 167]] 0.081911 [2, 3] 3.0 [0.001, 0.005623413251903491, 0.03162277660168... 1.0
In [524]:
recap_perfo_feat.to_excel('recap_perfo_newFeatures_v441.xlsx',sheet_name='features')
In [554]:
perfo_feat = xtract_perfo(recap_perfo_feat, 
                       ['business_metrics','roc_auc','f1_score','max_depth_opt', 'C_opt', 'learning_rate_opt'],
                       'business_metrics'
                      )
perfo_feat.to_excel('perfo_feat.xlsx',sheet_name='feat')
perfo_feat
Out[554]:
business_metrics roc_auc f1_score max_depth_opt C_opt learning_rate_opt
XGBoosting 0.081911 0.724017 0.111556 3.0 NaN 1.0
Regression Logistique 0.019612 0.741718 0.030547 NaN 1.0 NaN
Random Forests 0.000000 0.708015 0.000000 2.0 NaN NaN
Extra Trees 0.000000 0.687073 0.000000 2.0 NaN NaN

Gestion du problème de déséquilibre des classes

Regression logistique

In [525]:
from sklearn import linear_model
In [526]:
computation = apprentissage(
    linear_model.LogisticRegression(solver = 'liblinear', class_weight='balanced'),
    X2_train_std, y_train, X2_test_std, y_test,
    {'C': np.logspace(-3, 3, 5)},
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 254.25s
In [527]:
recap_perfo = pd.DataFrame(columns=computation.keys())
recap_perfo = affichage_resultats(computation, recap_perfo,'Regression Logistique')
recap_perfo
Out[527]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001 5 254.253108 0.743663 0.680833 0.257384 [[18937, 8663], [766, 1634]] 0.410615

Random forests

In [582]:
computation = apprentissage(
    ensemble.RandomForestClassifier(oob_score=True, n_estimators=100, class_weight='balanced'),
    X2_train_std, y_train,
    X2_test_std, y_test,     
    {'max_depth':np.arange(2,6)},
    has_feature_importance=True,
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 23.62s
In [529]:
recap_perfo = affichage_resultats(computation, recap_perfo,'Random Forests')
recap_perfo
Out[529]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001 5 254.253108 0.743663 0.680833 0.257384 [[18937, 8663], [766, 1634]] 0.410615 NaN NaN
Random Forests NaN NaN 5 25.382872 0.727041 0.662500 0.246283 [[18678, 8922], [810, 1590]] 0.395286 [2, 3, 4, 5] 5.0

Extra Trees

In [530]:
computation = apprentissage(
    ensemble.ExtraTreesClassifier(n_estimators=100, class_weight='balanced'),
    X_train_std, y_train,
    X_test_std, y_test,
    {'max_depth':np.arange(2,10)},
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 51.24s
In [531]:
recap_perfo = affichage_resultats(computation, recap_perfo,'Extra Trees')
recap_perfo
Out[531]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001 5 254.253108 0.743663 0.680833 0.257384 [[18937, 8663], [766, 1634]] 0.410615 NaN NaN
Random Forests NaN NaN 5 25.382872 0.727041 0.662500 0.246283 [[18678, 8922], [810, 1590]] 0.395286 [2, 3, 4, 5] 5.0
Extra Trees NaN NaN 5 51.236213 0.706586 0.621250 0.232189 [[18648, 8952], [909, 1491]] 0.371950 [2, 3, 4, 5, 6, 7, 8, 9] 9.0

XGradient Boosting

In [532]:
scale = (y_train==0).sum() / (y_train==1).sum()

computation = apprentissage(
    xgb.XGBClassifier(scale_pos_weight=scale, n_estimators=100),
    X2_train_std, y_train, 
    X2_test_std, y_test,
    {'learning_rate': np.logspace(-3, 0, 5), 'max_depth':np.arange(2,4)},
    has_feature_importance=True,
    feat_nm=feat,
    scoring=business_metrics
)
The computation time for training the model is 122.97s
In [533]:
recap_perfo_cw = affichage_resultats(computation, recap_perfo, 'XGBoosting')
recap_perfo_cw
Out[533]:
C C_opt cv time roc_auc recall f1_score confusion_matrix business_metrics max_depth max_depth_opt learning_rate learning_rate_opt
Regression Logistique [0.001, 0.03162277660168379, 1.0, 31.622776601... 0.001 5 254.253108 0.743663 0.680833 0.257384 [[18937, 8663], [766, 1634]] 0.410615 NaN NaN NaN NaN
Random Forests NaN NaN 5 25.382872 0.727041 0.662500 0.246283 [[18678, 8922], [810, 1590]] 0.395286 [2, 3, 4, 5] 5.0 NaN NaN
Extra Trees NaN NaN 5 51.236213 0.706586 0.621250 0.232189 [[18648, 8952], [909, 1491]] 0.371950 [2, 3, 4, 5, 6, 7, 8, 9] 9.0 NaN NaN
XGBoosting NaN NaN 5 122.965981 0.757579 0.678333 0.273522 [[19724, 7876], [772, 1628]] 0.426089 [2, 3] 3.0 [0.001, 0.005623413251903491, 0.03162277660168... 0.177828
In [534]:
recap_perfo_cw.to_excel('recap_perfo_classWeights_v44.xlsx',sheet_name='balanced')
In [555]:
perfo_cw = xtract_perfo(recap_perfo_cw, 
                       ['business_metrics','roc_auc','f1_score','max_depth_opt', 'C_opt', 'learning_rate_opt'],
                       'business_metrics'
                      )
perfo_cw.to_excel('perfo_cw.xlsx',sheet_name='cw')
perfo_cw
Out[555]:
business_metrics roc_auc f1_score max_depth_opt C_opt learning_rate_opt
XGBoosting 0.426089 0.757579 0.273522 3.0 NaN 0.177828
Regression Logistique 0.410615 0.743663 0.257384 NaN 0.001 NaN
Random Forests 0.395286 0.727041 0.246283 5.0 NaN NaN
Extra Trees 0.371950 0.706586 0.232189 9.0 NaN NaN

Evolution de quelques métriques avec les phases d'évaluation des modèles

Evolution of AUC

In [576]:
# Récupération de l'évolution du roc_auc
evol_auc = pd.DataFrame()
evol_auc ["basic"]= perfo_basic["roc_auc"]
evol_auc ["metier"]= perfo_metier["roc_auc"]
evol_auc ["feat"]= perfo_feat["roc_auc"]
evol_auc ["class_weight"]= perfo_cw["roc_auc"]

evol_auc = evol_auc.dropna(axis="rows")
evol_auc

#Tracé du graphe d'évolution

plt.figure(figsize = (8, 8))
for ind, content in evol_auc.iterrows():
    ax = plt.plot(content, markersize=8)
    plt.title('Evolution of AUC with phases of models fitting')
    plt.legend(labels=evol_auc.index)

plt.savefig('./Evol_auc.png', bbox_inches='tight')

Evolution of f1_score

In [577]:
# Récupération de l'évolution du f1_score
evol_f1 = pd.DataFrame()
evol_f1 ["basic"]= perfo_basic["f1_score"]
evol_f1 ["metier"]= perfo_metier["f1_score"]
evol_f1 ["feat"]= perfo_feat["f1_score"]
evol_f1 ["class_weight"]= perfo_cw["f1_score"]

evol_f1 = evol_f1.dropna(axis="rows")
evol_f1

#Tracé du graphe d'évolution

plt.figure(figsize = (8, 8))
for ind, content in evol_f1.iterrows():
    ax = plt.plot(content, markersize=8)
    plt.title('Evolution of F1-score with phases of models fitting')
    plt.legend(labels=evol_f1.index)

plt.savefig('./Evol_f1.png', bbox_inches='tight')

Evolution of business_metrics

In [578]:
# Récupération de l'évolution du business_metrics
evol_bm = pd.DataFrame()
evol_bm ["metier"]= perfo_metier["business_metrics"]
evol_bm ["feat"]= perfo_feat["business_metrics"]
evol_bm ["class_weight"]= perfo_cw["business_metrics"]

evol_bm = evol_bm.dropna(axis="rows")
evol_bm

#Tracé du graphe d'évolution

plt.figure(figsize = (8, 8))
for ind, content in evol_bm.iterrows():
    ax = plt.plot(content, markersize=8)
    plt.title('Evolution of business metrics with phases of models fitting')
    plt.legend(labels=evol_bm.index)

plt.savefig('./Evol_bm.png', bbox_inches='tight')